A Dual Module Approach for High Accuracy Phishing Detection and Email Prioritization using NLP and Machine Learning

Authors: Kukatla Sai Bharavi, Dr. B. Kranthi Kiran

DOI Link: https://doi.org/10.22214/ijraset.2025.73708

Abstract

Phishing attacks are a serious and ongoing problem in digital communication, where attackers use weaknesses to get sensitive information. Traditional ways of protecting against these attacks often struggle to keep up with the changing methods used by hackers, which means there is a need for smarter and more flexible solutions. This paper presents a system with two parts designed to make email security better and help people manage their emails more efficiently. The first part uses features from Natural Language Processing and machine learning to decide if an email is phishing or not. The second part looks at emails that are not phishing and assigns them a priority based on their content and situation. To test this system, we used two different sets of data: one standard set used for checking spam and one we created to represent real-life situations. You should specifically state the accuracies here. For instance: \"In both cases, the system performed well in identifying phishing emails and setting the right priorities. The phishing detection module achieved an accuracy of 97.97% on the Standard Spam/Ham Dataset and 82.22% on the Custom-Built Dataset, while the email prioritization module achieved 99.71% and 98.89% respectively, even when the data had some errors. These results show that the system is strong and could be a good solution for improving email security and management today.\"

Introduction

Phishing remains a major cybersecurity threat, using increasingly advanced and emotionally manipulative tactics. Traditional detection methods (e.g., rule-based filters or keyword blacklists) are no longer effective. To address this, the paper proposes a two-part system:

Phishing Detection Module
Email Prioritization Module

The goal is to improve email security and help users focus on important emails, using Machine Learning (ML) and Natural Language Processing (NLP).

2. Literature Review

Early approaches relied on rule-based filters and blacklists, which failed to adapt to evolving threats.
ML techniques (e.g., SVMs, Decision Trees) improved detection by analyzing email features.
NLP advancements (e.g., BERT) added deeper contextual understanding to detect sophisticated attacks.
Recent research has also focused on smart email management using automation (e.g., RPA), but few systems combine threat detection and productivity enhancement. This work aims to bridge that gap.

3. Methodology

The system architecture has three core components:

User Interface: Visual dashboard with email status and priority indicators.
Detection Engine: Scans emails for phishing threats and assigns priority levels.
Response Mechanism: Automatically quarantines threats and helps users respond appropriately.

A. Phishing Detection Module

Uses basic NLP preprocessing (tokenization, stop-word removal, etc.).
Applies Logistic Regression for binary classification (phishing or non-phishing).
Learns phishing patterns from labeled datasets.

B. Email Prioritization Module

Classifies safe emails into High, Medium, or Low priority.
Uses keyword analysis (e.g., "urgent", "deadline") and topic modeling to assess importance.
Outputs a score via ML classification for user-friendly organization.

C. Datasets

Standard Spam/Ham Dataset: Clean, labeled benchmark emails.
Custom-Built Dataset: Noisy, real-world emails with imperfect labels to simulate realistic conditions.

4. Implementation

Developed using Python and run on Google Colab.
Libraries used:
- scikit-learn: ML models (logistic regression)
- pandas: Data handling
- NLTK: Text preprocessing
NLP techniques prepare email content for classification and prioritization.

5. Results

Two experiments were conducted:

1. Standard Dataset Results

Phishing Detection Accuracy: 97.97%
Email Prioritization Accuracy: 99.71%
High performance in clean conditions.

2. Custom Dataset Results

Phishing Detection Accuracy: 82.2%
- Significant drop due to sophisticated phishing tactics and noisy data.
Email Prioritization Accuracy: 98.89%
- Maintained high accuracy, even with messy inputs.

6. Key Insights & Future Work

The system performs well overall, especially in organizing important emails.
Phishing Detection needs improvement in real-world, noisy environments.
Future work includes integrating advanced models like BERT to better capture context and deception in phishing emails.

Conclusion

This paper presents a dual-module system aimed at solving the connected problems of email security and effective communication management. Using natural language processing and machine learning, the system combines phishing detection with email prioritization in one unified system. Tests were done using a standard dataset and a specially created dataset that mimics real email situations. The phishing detection part was correct 82.22% of the time on the Standard Spam/Ham Dataset and 97.97% of the time on the Custom-Built Dataset, and the email prioritization part performed very well, being right 99.71% of the time on the Standard Spam/Ham Dataset and 98.89% of the time on the Custom-Built Dataset. The results show that combining security and usability in email handling is not only possible but also very useful. The success of the prioritization part, along with the good performance of the phishing filter, shows that even simple models can give big benefits when the features are well chosen and the system is built smartly with separate parts. Truly understand the capabilities of our new dual-module email system, we put it through two rigorous tests.

References

[1] Sathish, C., Mahesh, A., Karpagam, N.S., Vasugi, R., Indumathi, J. and Kanchana, T., 2023, March. Intelligent Email Automation Analysis Driving through Natural Language Processing (NLP). In 2023 Second International Conference on Electronics and Renewable Systems (ICEARS) (pp. 1612-1616). IEEE. [2] Peng, T., Harris, I. and Sawa, Y., 2018, January. Detecting phishing attacks using natural language processing and machine learning. In 2018 ieee 12th international conference on semantic computing (icsc) (pp. 300-301). IEEE. [3] Al-Yozbaky, R.S. and Alanezi, M., 2023, June. Detection and analyzing phishing emails using nlp techniques. In 2023 5th International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)(pp. 1-6). IEEE. [4] Salloum, S., Gaber, T., Vadera, S. and Shaalan, K., 2022. A systematic literature review on phishing email detection sing natural language processing techniques. IEEE Access, 10, pp.65703-65727. [5] Khare, A., Singh, S., Mishra, R., Prakash, S. and Dixit, P., 2022, March. E-Mail Assistant–Automation of E-Mail Handling and Management using Robotic Process Automation. In 2022 International Conference on Decision Aid Sciences and Applications (DASA) (pp. 511-516). IEEE. [6] Chinnasamy, P., Krishnamoorthy, P., Alankruthi, K., Mohanraj, T., Kumar, B.S. and Chandran, L., 2024, March. AI Enhanced Phishing Detection System. In 2024 Third International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS) (pp. 1-5). IEEE. [7] Madhu Sudhan H V, 2019. Intelligent Email Extraction and Classification with NLP & Deep Learning. International Journal of Science and Research (IJSR), DOI: 10.21275/ART20196371. [8] Anilkumar, C., Karrothu, A., Mouli, N.S. and Tej, C.B., 2023, January. Recognition and processing of phishing emails using NLP: A survey. In 2023 International Conference on Computer Communication and Informatics (ICCCI) (pp. 1-4). IEEE. [9] Rabbi, M.F., Champa, A.I. and Zibran, M.F., 2023, May. Phishy? detecting phishing emails using ml and nlp. In 2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA) (pp. 77-83). IEEE. [10] Egozi, G. and Verma, R., 2018, November. Phishing email detection using robust nlp techniques. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 7-12). IEEE

Copyright

Copyright © 2025 Kukatla Sai Bharavi, Dr. B. Kranthi Kiran. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET73708

Publish Date : 2025-08-16

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here